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I. INTRODUCTION 

Multimedia search and retrieval has become an active research field thanks to the increas- 
J£ demand ma, accompanies many new practical applications. The applications include 
arle-^ale multimcd.a search engines on the Web, media asset management systems ui 
cXrations, aud.ov.sual broadcast servers, and personal media servers for consumers. 
0^ requiremenis derived from these applications impose great challenges and incen- 

tive<; for research in this field. 

Application requirements and user needs often depend on the context and the appli- 
cation scenarios. Professional users may want to find a specific piece of content (eg., an 
taage) from a large collection within a tight deadline, and leisure users may want .10 
bmwse the clip art catalog to get a reasonable selection. Online users may want to filter 
uTugh a massive amounfof information to receive information ^^^^ 
est s only, whereas offline users may want to get informative summaries of selected content 

fr ° m With meTncrcasing interest from researchers and application developers, there have 
been several major publications and conferences dedicated to survey of important ad- 
vances a^d open issues in this area. Given the dynamic nature of applications and research 
Ttnts oroaLrea, any survey paper is also subject to the risk of being -complete or 
obsolete With this perspective, we focus this chapter on several major emerging trends 
r^earch as well as standards related to the general field of multimedia search and 
retrieval. Our goal is to present some representative approaches and discuss issues that 
reauire further fundamental studies. 

Section II addresses a promising direction in integrating multimedia features in ex- 
tracting the syntactic and semantic structures in video. It introduces some domain-specific 
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techniques (e.g., news) combining analysis of audio, video, and text information in analyz- 
ing content at multiple levels. Section III focuses on a complementary direction in which 
visual objects and their features are analyzed and indexed in a comprehensive way. These 
approaches result in search tools that allow users to manipulate visual content directly to 
form multimedia queries. Section IV shows a new direction, incorporating knowledge 
from machine learning and interactive systems, to break the barriers of decoding semantics 
from multimedia content. Two complementary approaches are presented: the probabilistic 
graphic model and the semantic template. Section V covers an important trend in the 
multimedia content description standard, MPEG-7, and its impact on several applications 
such as an interoperable metasearch environment. 

II. VIDEO SEGMENTATION, INDEXING, AND BROWSING 

As discussed in the previous section, different methods are suitable for different contexts 
when people access large collections of information. Video content has unique characteris- 
tics that further affect the role of each access method. For example, the sequential browsing 
method may not be suitable for long video sequences. In this case, methods using content 
summaries, such as those based on the table of contents (ToC), are very useful in providing 
quick access to structured video content. 

Different approaches have been used to analyze the structures of video. One tech- 
nique is to do it manually, as in the cases of books (ToC) or broadcast news (closed 
captions) delivered by major American national broadcast news companies. Because man- 
ual generation of an index is very labor intensive and, thus, expensive, most sources of 
digital data in practice are still delivered without metainformation about content structures. 
Therefore, a desirable alternative is to develop automatic or semiautomatic techniques for 
extracting semantic structure from the linear data of video. The particular challenges we 
have to address include identification of the semantic structure embedded in multiple me- 
dia and discovery of the relationships among the structures across time and space (so that 
higher levels of categorization can be derived to facilitate further automated generation 
of a concise index table). 

Typically, to address this problem, a hierarchy with multiple layers of abstractions 
is needed. To generate this hierarchy, data processing in two directions has to be per- 
formed: first, hierarchically segmenting the given data into smaller retrievable data units, 
and second, hierarchically grouping different units into larger, yet meaningful, categories. 
In this section, we focus on issues in segmenting multimedia news broadcast data into 
retrievable units that are directly related to what users perceive as meaningful. The basic 
units after segmentation can be indexed and browsed with efficient algorithms and tools. 
The levels of abstraction include commercials, news stories, news introductions, and news 
summaries of the day. A particular focus is the development of a solution that effectively 
integrates the cues from video, audio, and text. 

Much research has concentrated on segmenting video streams into "shots'* using 
low-level visual features [1-3]. With such segmentation, the retrievable units are low- 
level structures such as clips of video represented by key frames. Although such systems 
provide significant reduction of data redundancy, better methods of organizing multimedia 
data are needed in order to support true content-based multimedia information archiving, 
search, and retrieval capabilities. The insufficiency associated with low-level visual fea- 
ture-based approaches is multifold. First, these low-level structures do not correspond in 
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-™«,t wa v to the underlying semantic structure of the content, making 
a direct and "^^"^^^^^ number of low-level structures , generated 
it difficult for - browsing efficiency compared with 

from the «f"«^J^^with an information retrieval system, they expect the 
linear search. ^^^^ of the content available. Tho 

system to V^J^^^Z userS to construct clear, unambiguous quenes and 
system should ^fV™*™^, long enough to be informative and as short as possible 
return requested ^^^^1^ • Thus, multimedia processing systems need to und^ 

"scene cut -based systems ( more atKntioI1 has been directed to auto- 

l^«-«^«J^?~LrfU Genres from multimedia data 15-13]. 



ov " ' . . d se Therefore, more attention nas uccn ~ — 

— S "y meaning s»»ct»res multimedia data 15-13,. 

A. Integrated Semantic Segmentation Using 
Multimedia Cues 

One direction in semantic-level -P-"^ ™ ~ 

f of meaning*! ^ ^ ^ d 

down" events from a ^f™'* 

Another approach generates a ^°l°^l^ ent such „ a news story [4-12,15- 
so that each f^*^ ^^^S^ structure of the data that 

17]. In other words, the aim u ^^^^ mttaK (such structure is lost when the 
reflects the original intention of the "^^^ this line has made use of 
data are being recorded on the hnear medu* Some ^ ^ * wim visual cues only , 

it 1S extremely difficult to recover ^tn windo ws, it has been shown that obtaining 

because of the typical use of ^^^^^ of segmentation varies with the 
a precise semantic ^^f^^ZZ from'different media to achieve 
in 2£S£» [4A10-131. By combining cues, the story bound- 
aries can be identified more precisely. wn format 

of dosed capuons. the work of Me ^ b0UIldaries . shahrar a y 

"f^ZTlS IWui *e ciosed-capuon mode information to segment commercials 

signals are utilized ^^^^^1^ and g C ommercials; anchorperson segments 
Aud«> features « ^^S^gnition techniques; story-level segmentation is 
"l de T!l° e n tfxt anXis *attterrmne 8 s how blocks of text should be merged to form 
'^^^^^^ isamiat ^ and an overall news summary of the day. 
news stones, individual story m with ^ media data 

5-— " Si - oe constructed across different media 
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in such a way that the audiovisual presentation can convey the semantics effectively 
[5,6,11,12]. 

In this section, we review an integrated solution for automated content structuring 
for broadcast news programs. We use this example to illustrate the development of tools 
for retrieving information from broadcast news programs in a semantically meaningful 
way at different levels of abstraction. 

A typical national news program consists of news and commercials. News consists 
of several headline stories, each of which is usually introduced and summarized by the 
anchor prior to and following the detailed reports by correspondents, quotations, and inter- 
views of newsmakers. 

Commercials are usually found between different news stories. With this observa- 
tion, wc try to recover this content hierarchy by utilizing cues from different media when- 
ever it is appropriate. Figure 1 shows the hierarchy we intend to recover. In this hierarchy, 
the lowest level contains the continuous multimedia data stream (audio, video, text). At 
the next level, we separate news from commercials. The news is then segmented into the 
anchorpcrsons speech and the speech of others. The intention of this step is to use the 
recognized anchor" s identity to hypothesize a set of story boundaries that consequently 
partition the continuous text into adjacent blocks of text. Higher levels of semantic units 
can then be extracted by grouping the text blocks into news stories and news introductions. 
In turn, each news story can consist of either the story by itself or the story augmented 
by the anchorperson's introduction. Detailed semantic organization at the story level is 
shown in Figure 2. 

The remaining content in a news program after the commercials are removed con- 
sists of news segments that are extracted using the detected anchor's speech [11,12]. Figure 
2 illustrates what each news segment can be classified into and further can be merged with 
others to form different semantics. Using duration information, each segment is initially 




Broadcast news programs: across multiple media, linear in time, flat structure. 



Figure 1 Content hierarchy of broadcast news programs. 
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Figure 2 Relationship among 



the semantic structures at the story level. 



Cashed as *. stoxy body J^^M^ tStEZZZ, 

t121 ' The news data are segmented into multiple layers in a hierarchy to meet different 
For n^ince some users may want to retrieve a story directly, others may want 
needs. For instance, / d id which story sounds interesting 

C haveTtotally different need to monitor commercials of compears in order 
sector) may have a v commercial . ^ segmentation 'mechanism partitions the 
"JIZ t iSS waV^ that direct indices to the events of different interests 
can be automatically established. 

B. Representations and Browsing Tools 

When semantic structures are recovered and indexed, efficient tools are needed to present 
r e xt-S emetics in a form that is compact, concise, easy to understand, and at to 
f H^e visuaUv Pleasing Now, we discuss the representation issue at three levels for 
tT'^^d^m^w to present the semantic structure of the broadcast news to 
"ZlX recent particular semantics based on the content of a news story, 

and (3) how to form the representation for news summary of the day. 

A commonly used presentation for semantic structure is m the form of a table of 
, Z TSL in order to give users a sense of time, we also use a streamline 
^ZttoZ^clL*. Figure 3 shows one presentation for the semantic 
^SSTi^s program. On the left side of the screen, different semantics are catego- 
n"h1 forX fable of contents (commercials, news, and individual news stones 
etc ) It s in aTmiUar hierarchical fashion that indexes directly into the tm,e-starnped 
a- item listed is color coded by an icon or a button. To play back a particular 

S^Sl button fo/the desired item in this hierarchical table. At 
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Figure 3 Representation for extracted semantic structures. 

the right of this interface is the streamline representation, where the time line runs from 
left to right and top to bottom. The time line has two layers of categorization. The first 
layer is event based (anchor's speech, others' speech, and commercials) and the second 
layer is semantic based (stories, news introduction, and news summary of the day). Each 
distinct section is marked by a different color and the overall color codes correspond to 
the color codes used in the table of contents. Obviously, the content categorized in this 
representation is aligned with time. 

To represent each news story, two forms are considered. One is static (Storylcon) 
and the other is dynamic (multimedia streaming). To form the static presentation of a 
news story, we automatically construct the representation for a story that is most relevant 
to the content of the underlying story. Textual and visual information is combined. Figure 
4 gives an example of this static representation for the story about the damage that El Nino 
caused in California. In Figure 4, the ToC remains on the left so that users can switch to 
a different selection at any time and the right portion is the static presentation of the story 
that is currently being chosen (story 3 in this example). In this static presentation, there are 
three parts: the upper left corner lists a set of keywords chosen automatically from the 
segmented story text based on the words' importance evaluated by their term frequency/ 
inverse document frequency (TF/IDF) scores, the right column displays the transcription 
of the story (which can be scrolled if users want to read the text), and the center part is the 
visual presentation of the story consisting of a number of key frame images chosen automati- 
cally from the video in a content-sensitive manner [12]. Notice that the keywords for each 
story are also listed next to each story item in the ToC so that users can get a feeling about 
the content of the story before they decide which story to browse. 
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Figure 4 Visual representotionfor stories about El Nino. 



«™v W e can see that the static representation is compact, semantically 

presentation of the story. The user ^ & button of ^ st0 ry 

to listen to that particular section of the st ory or ^ 
in the ToC to play back the entire story. ^^^Z c ZTstory w^E precise 

retrieval. presentation for the news summary of the day. It is com- 

Finally, we ^struct tne rep detected Qn a particular 

r^s^-X « *» m ° st impomm in each h story ' r ing 

day. The K "W 8 « representation [11.12]. Figure 5 gives the visual pre- 

the same f^^^^Z the day for the NBC Nightly News on February 12, 
sentation for the news summary oi m J that there are six headline stories 

1998. From this pres_ 27^"™^^ story, the list of keywords 
on that particular day. f^. ^ J^toTSd flickering fashion (the interval of flickering 
fr0 t£i Sor 5 etc £ w^ s seL of the story from the key words. 
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Figure 5 Representation for news summary of the day. 



village, and (6) why taxpayers should pay for empty government buildings. From these 
examples, the effectiveness of this storytelling visual representation for the news summary 
is evident. 



III. OBJECT-BASED SPATIO-TEMPORAL VISUAL 
SEARCH AND FILTERING 

An active research direction complementary to the preceding one using semantic-level 
structuring is the one that directly exploits low-level objects and their associated features 
in images or videos. An intuitive and popular approach is to segment and provide efficient 
indexes to salient objects in the images. Such segmentation processes can be implemented 
using automatic or semiautomatic tools [21,22]. Examples of salient objects may corre- 
spond to meaningful real-world objects such as houses, cars, and people or low-level 
image regions with uniform features such as color, texture, or shape. 

Several notable image-video search engines have been developed using this ap- 
proach, namely searching images or videos by example, by features, or by sketches^ 
Searching for images by examples or templates is probably the most classical method of 
image search especially in the domains of remote sensing and manufacturing. From an 
interactive graphic interface, users select an image of interest, highlight image regions, 
and specify the criteria needed to match the selected template. The matching catena may 
be based on intensity correlation or feature similarity between the template image and the 
target images. 
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structures of image regions its j . videos they have m nund 

[21,2&,z/j. user:* <uc 0 hiect color and the motion trail to find a video 

A. Object-Based Video Segmentation and Feature 
Extraction 

subsection. chr>t* A video shot has a consis- 

eludes links to conceptual abstractions of video objects. 



R9 ure 6 Objected sensing ^JE^^^S^ 
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also been developed. ^ ™^ , ^ may include single g 

^ me video <*g££Z?^ color pairs P^' ^JfbXd texture pi- 
sentanve Mlor, color msto^ . TamU ra texture, and Uws nuw polynomial 
domain textures, texture tastoyam. 1 moments of dtffe^orders P ^ 

341. Shape «, ^^pS^on, and algebraic 

'^Tcen^d otSh object and the ^^^dtsunctive features that 
tones ot the cenu^ . . The concept of VfrL is to <-*v «vw;velv The final selec- 

camera potion compensatio^ T^ v efficiently and eff ^^ sQurce (compu t- 

ing and storage) available and the app ^ 
object matching). ;milaritv between video objects, J^TL Stance the Mahala- 

^?SS S visual paradigm using »e 

images by the spaual arra ngeme n^ ^ locanons_As *» ^ 

regions can be queried o( ..joining" *e quenes based ^on t 

7 g me overall query r^,£T£rion m me query image is ^'"JS,™ opera- 

^hcXttoHnte^un^ 
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tu- ^ ohiect search system achieves fully automatic object segmentation and 
This video object search sy stem a , segmentation remain at 

feature extraction at a low level. H ° weve ^! ' usua fL do not cor respond 

a low level (e.g., image regions with uniform «^J fl ^^ Bm is to use some 
well to the real-world physical objects^One way to solve *^P^ 5] 

T l fr ° m ^ns^e^^^ "f ttanS " 

^^SJT^Sn^ - MPEG-4 usually correspond to the semantic ^nu- 
mitted separately, ine v j on segmentation and feature extraction can be 

objects) and thus allows more flexible search. 



IV. SEMANTIC-LEVEL CONTENT CLASSIFICATION AND 
FILTERING 

for identifying semantic concepts in multimedia data. 
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A. Content Modeling Using Probabilistic Graphic 
Models 

The first approach uses probabilistic graphic models to identify events, objects, and sites 
in multimedia streams by computing probabilities such as ^(underwater AND shark | 
segment of multimedia data). The basic idea is to estimate the parameters and structure 
of a model from a set of labeled multimedia training data. 

A multiject (multimedia object) has a semantic label and summarizes the time se- 
quences of low-level features of multiple modalities in the form of a probability Pieman- 
tic label | multimedia sequence). Most multijects fall into one of the three categories sites, 
objects, and events. Whereas some multijects are supported mainly by video (e.g., shark), 
others are supported mainly by audio (e.g., interior of traveling train) and still others are 
strongly supported by both audio and video (e.g., explosion). 

The lifetime of a multiject is the duration of multimedia input that is used to deter- 
mine its probability. In general, some multijects (e.g., family quarrel) will live longer than 
others (e.g., gunshot) and the multiject lives can overlap. For simplicity, we could break 
the multimedia into shots and within each shot fix the lifetimes of all multijects to the 
shot duration. Given the multiject probabilities, this leads to a simpler static inference 
problem within each shot. Although this approximation ignores event sequences within 
a shot (e.g., gunshot followed by family quarrel), it leaves plenty of room for useful 
inferences, because directors often use different shots to highlight changes in action and 
plot. 

We model each modality in a multiject with a hidden Markov model (HMM) [41]. 
We investigate what combinations of input features and HMM structure give sufficiently 
accurate models. For example, we found that in the case of the explosion multiject, a 
color histogram of the input frames and a three-state HMM give a reasonably accurate 
video model [42]. In contrast, a video model for a bird may require object detection and 
tracking. 

Each HMM in a multiject summarizes the time sequence for its corresponding mo- 
dality. In order to summarize modalities, it is necessary to identify likely correspondences 
between events in the different modalities. For example, Figure 9a and b show the posterior 
probabilities that an explosion occurred in a movie clip at or before time t under an audio 
and a video HMM that were each trained on examples of explosions. In this case, the 
sound of the explosion begins roughly 0.15 sec (eight audio frames at 50 Hz or five video 
frames at 30 Hz) later than the video of the explosion. In other cases, we found that the 
audio and video events were more synchronized. To detect explosions accurately, the 
explosion multiject should be invariant to small time differences between the audio and 
video events. However, if the multiject is overly tolerant of the time difference, it may 
falsely detect nonexplosions. For example, a shot consisting of a pan from a sunrise to a 
waterfall will have a flash of bright red followed, after a large delay, by a thundering 
sound with plenty of white noise. Although the audio and video features separately match 
an explosion, the large time delay indicates that it is not an explosion. 

We have used two methods [42,43] for summarizing the modalities in a multiject. 
In the first method, each HMM models an event in a different modality and the times at 
which the event occurs in the different modalities are loosely tied together by a kernel 
(e.g., Gaussian). An example of a pair of such event-coupled HMMs is shown in Figure 
9c, where r A and t v are the times at which the explosion event begins in the audio and 
video. In Reference 42, we discussed an efficient algorithm for computing the probability 
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n , hmtv nmtcnor probabilities that an explosion occurred in a movie clip at or 
SToLfi b rS£STJ*. HMM- M A graphic -ode, for . molUJec c^ta*. 
object) that couples together the two HMMs. 

u - ^ ~a />i«Dlosion I video sequence, audio sequence). In the second method 
^.CS^mIS on the states of the HMMs for the different modalities. 
W 'K I ft ^!edv bottom-up algorithm, the probability of the multiject can be approxi- 
mated out l^T^ *" ** ^ ^ "h 
SSrfTdSSe a multiject representation that is simple enough for leammg and 

inference while being powerful enough to answer useful queries. 

TuppTe we are Crested in automatically identifying movie clips of exouc birds 
from f mov e Ubrary. We could compute the probability of the bird multiject for each 
T X lot mdic library and then rank the shots according to these probabilities, 
mulumedi t sh<* in the llbr ^ n of Qther related multijects> we may be able to 

SSJLSSEX t - bird multiset. For ^^^^SS 
orovides evidence of an exotic location where we are more likely to find an exotic a 
As a casting example, an active underwater multiject decreases the support for a btrd 

^wTuTe^ 

We use ttie « Figure 10 shows an example of a multmet. In general, 

edeS) ^Scribed in the previous section. The multijects are further mterconnected to 
toZl grapS ^cal pTobability'model. Initially, we will investigate ^~^ rno6- 
iniftoat associate a real-valued weight with each undirected edge m the graph. The 
els [44] that multijects are correlated a priori (before the data are 

ZtXTZTT^ns indicate that the mulcts are correlated a priori 
whereas minus signs indicate that the multijects are anticorrelated. 
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Figure 1 0 A multinet (multimedia network) probabilistically links multijects (multimedia objects) 
to the data and also describes the probabilistic relationships between the multijects. Plus signs indi- 
cate the multijects are correlated a priori (before the data are observed); minus signs mdicate the 
multijects are anticorrelated. 

Returning to the bird example, a plus sign on the connection between the bird 
multiject and the waterfall multiject indicates that the two multijects are somewhat likely 
to be present simultaneously. The minus sign on the connection between the bird multiject 
and the underwater multiject indicates that the two multijects are unlikely to be present 
simultaneously. The graphical formulation highlights interesting second-order effects. For 
example, an active waterfall multiject supports the underwater multiject, but these two 
multijects have opposite effects on the bird multiject. 

In general, exact inference such as computing P(bird | multimedia data) m a richly 
connected graphical model is intractable. The second-order effects just described imply 
that many different combinations of multijects need to be considered to find likely combi- 
nations However, there has been promising work in applying approximate inference tech- 
niques to such intractable networks in areas including pattern classification, unsupervised 
learning, data compression, and digital communication [45]. In addition, approaches using 
Markov chain Monte Carlo methods [44] and variational techniques [46] are promising. 



B. Indexing Multimedia with Semantic Templates 

In this subsection, we discuss a different semantic-level video indexing technique, seman- 
tic templates (STs) [47]. Semantic templates associate a set of exemplar queries with each 
semantic. Each query in the template has been chosen because it has been successful at 
retrieving the concept. The idea is that because a single successful query rarely completely 
represents the information that the user seeks, it is better to cover the concept using a set 
of successful queries. Semantic templates can be defined over indexable media of any 
type. Here we focus on video, semantic visual templates (SVTs). Figure 1 1 shows example 
SVTs for the "high jumper" concept and the "sunsets" concept. 

The goal of the ST is similar to that of multijects, to detect and recognize video 
objects sites, or events at the semantic level. However, the approach is based on different 
but synergistic principles. The generation of a semantic template involves no labeled 
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Figure 11 A semantic-level search paradigm using semantic visual templates. Image icons shown 
are subsets of optimal templates for each concept [(a) * 'sunset'' (b) "high jumper"] generated 
through the two-way interactive system. 



ground truth data. What we do require is that some positive examples of the concept be 
present in the database so that they can be used in the interactive process when users 
interact with the system to generate the optimal set of STs. Development of STs utilizes 
the following unique principles. 

Two-way learning: The template generation system emphasizes the two-way learn- 
ing between the human and the machine. Because the human being is the final 
arbiter of the "correctness" of the concept, it is essential to keep the user in 
the template generation loop. The user defines the video templates for a spe- 
cific concept with the concept in mind. Using the returned results and rele- 
vance feedback, the user and the system converge on a small set of queries 
that best match (i.e., provide maximal recall of) the user's concept. 

Intuitive models'. Semantic templates are intuitive, understandable models for se- 
mantic concepts in the videos. The final sets of SVTs can be easily viewed 
by the user. Users can have direct access and make manipulation to any tem- 
plate in the library. 

Synthesizing new concepts: Different STs can be graphically combined to synthesize 
more complex templates. For example, templates for high jumpers and crowds 
can be combined to form a new template for "high jumpers in front of a 
crowd.' * The audio templates for * 'crowd' * and 4 'ocean sounds' ' and the visual 
template for ''beach" can be combined to form the ' 'crowds at a beach" 
template. 

The template framework is an extended model of the object-oriented video search 
engine described in Section III. The video object database consists of video objects and 
their features extracted in the object segmentation process. A visual template may consist 
of two types of concept definitions: object icons and example scenes/objects. The object 
icons are animated sketches such as the ones used in VideoQ. In VideoQ, the features 
associated with each object and their spatial and temporal relationships are important. The 
example scenes or objects are represented by the feature vectors extracted from these 
scenes/objects. Typical examples of feature vectors that could be part of a template are 
histograms, texture information, and structural information (i.e., more or less the global 
characteristics of the example scenes). The choice between an icon-based realization and 
an example-based realization depends on the semantic that we wish to represent. For 
example, a "sunset" can be very well represented by using a couple of objects, whereas 
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a waterfall or a crowd is better represented using example scenes characterized by a global 
feature set. Hence, each template contains multiple icons and example scenes/objects to 
represent the idea. The elements of the set can overlap in their coverage. The goal is to 
come up with a minimal template set with maximal coverage. 

Each icon for the concept comprises multiple objects that are associated with a set 
of visual attributes. The relevance of each attribute and each object to the concept is 
also specified using a context specification questionnaire. For example, for the concept 
"sunsets " color and spatial structures of the objects (sun and sky) are more relevant. 
The object "sun" may be nonmandatory because some sunset videos may not have the 
sun visible. For the concept "high jumper," the motion attribute of the foreground object 
(mandatory) and the texture attribute of the background object (nonmandatory) are more 
relevant than other attributes. 

Development and application of semantic templates require the following compo- 
nents: 

Generation: This is used to generate STs for each semantic concept. We will de- 
scribe an interactive learning system in which users can interactively define 
their customized STs for a specific concept. 

Metric This is used to measure the "fitness" of each ST in modeling the concept 
associated with the video shot or the video object. The fitness measure can 
be modeled by the spatiotemporal similarity between the ST and the video. 

Applications: An important challenge in applications is to develop a library of se- 
mantic concepts that can be used to facilitate video query at the semantic level. 
We will describe the proposed approaches to achieving such systems later. 

Automatic generation of the STs is a hard problem. Hence we use a two-way interac- 
tion between the user and system in order to generate the templates (shown m Figure 12). 
In our method, given the initial query scenario and using relevance feedback, the system 
converges on a small set of icons (exemplar queries for both audio and video) that gives 
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Figure 12 System architecture for semantic visual template. 
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us maximum recall. We now explain the mechanisms for generation of semantic visual 
templates. 

The user comes to the system and sketches out the concept for which he wishes 
to generate a template. The sketch consists of several objects with spatial and temporal 
constraints. The user can also specify whether or not the object is mandatory. Each object 
is composed of several features. The user also assigns relevance weights to each feature 
of each object. This is the initial query scenario that the user provides to the system. 

The initial query can also be viewed as a point in a high-dimensional feature space. 
Clearly, we can also map all videos in the database in this feature space. Now, in order 
to generate the possible icon set automatically, we need to make jumps in each of the 
features for each object. Before we do so, we must determine the jump step size, i.e., 
quantize the space. This we do with the help of the weight that the user has input along 
with the initial query. This weight can be thought of as the user's belief in the relevance 
of the feature with respect to the object to which it is attached. Hence, a low weight gives 
rise to coarse quantization of the feature and vice versa. 

Because the total number of icons possible using this technique increases very rap- 
idly, we do not allow for joint variation of the features. For each feature in each object, 
the user picks a plausible set for that feature. The system then performs a join operation 
on the set of features associated with the object. The user then picks the joins that are 
most likely to represent variations of the object. This results in a candidate icon list. 

In a multiple-object case, we do an additional join with respect to the candidate lists 
for each object. Now, as before, the user picks the plausible scenarios. After we have 
generated a list of plausible scenarios, we query the system using the icons the user has 
picked. Using relevance feedback on the returned results (the user labels the returned 
results as positive or negative), we then determine the icons that provide us with maximum 
recall. 

We now discuss a detailed example showing the generation mechanism for creating 
the semantic visual template for slalom skiers. 

We begin the procedure by answering the context questionnaire shown in Figure 
13a. We label the semantic visual template "slalom." We specify that the 
query is object based and will be composed of two objects. Then (Fig. 13b), 
we sketch the query. The large, white background object is the ski slope and 
the smaller foreground object is the skier with its characteristic zigzag motion 
trail. 

We assign maximum relevance weights to all the features associated with the back- 
ground and skier. We also specify that the features belonging to the back- 
ground will remain static while those of the skier can vary during template 
generation. Then the system automatically generates a set of test icons, and 
we select plausible feature variations in the skier's color and motion trajectory. 

A set of potential icons including both the background and the foreground skier are 
shown in Figure 13c. The user then chooses a candidate set to query the sys- 
tem. The 20 closest video shots are retrieved for each query. The user provides 
relevance feedback, which guides the system to a small set of exemplar icons 
associated with slalom skiers. 

As mentioned earlier, the framework of the semantic templates can be applied to 
multiple medium modalities including audio and video. Once we have the audio and the 
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Figure 13 Example of semantic template development — Slalom, (a) Questionnaire for users to 
define the objects and features in a template; (b) initial graphic definition of the template; and (c) 
candidate icons generated by the system. 

visual templates .for different concepts such as "skiing" and "sunsets," the user can 
interact with the system at the concept level. The user can compose a new multimedia 
concept that is built using these templates. For example if the user wanted to retrieve a 
group of people playing beach volleyball, he would use the visual templates of beach 
volleyball and beach sounds to generate a new query: { {Video: Beach volleyball, Beach}, 
{Audio: Beach Sounds} }. Then, the system would search for each template and return a 
result based on the user's search criterion. For example, he may indicate that he needs 
only some of the templates to be matched or that he needs all templates to be matched. 
Also, once we have a collection of audio and video templates for a list of semantics, we 
use these templates to match with the new videos and thereby generate a list of potential 
audio and video semantics (thereby generating a semantic index) that are associated with 
the video clip. Early results on querying using SVTs indicate that the concept works well in 
practice. For example, in the case of sunsets, the original query over a large heterogeneous 
database yielded only 10% recall. Using eight icons for the template, we boosted this 
result to 50%. 
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V INTEROPERABLE CONTENT DESCRIPTION 
SCHEMES AND METASEARCH ENGINES 

Techniques discussed earlier contribute to the state of the art in multimedia search and 
retrieval. Efficient tools and systems have been developed at different levels, including 
the physical level (e.g., image object matching) and the semantic level (e.g news video 
content structure and semantic labeling). These tools can be optimized for the maximum 
power in specialized application domains. However, in many cases customized techniques 
may be used by different service providers for specialized content collection. Interoperabil- 
ity among the content indexes and the search functions becomes a critical issue^ How do 
we develop a transparent search and retrieval gateway to hide the proprietary indexes and 
search methods and for users to access content.in heterogeneous content sources? 

A. Content-Describing Schemes and MPEG-7 

To describe various types of multimedia information, the emerging MPEG-7 standard [48] 
has the objective of specifying a standard set of descriptors as wen as description schemes 
(DSs) for the structure of descriptors and their relationships. This ^caption (i.e. the 
combination of descriptors and description schemes) will be associated withthe ^content 
itself to allow fast and efficient searching for material of a user's interest. MPEG-7 will 
also standardize a language to specify description schemes, (i.e., a description definition 
laneuage DDL), and the schemes for encoding the descriptions of multimedia content. 

In this section, we briefly describe a candidate interoperable content description 
scheme [49] and some related research on image metasearch engines [50] The motives 
of using these content description schemes for multimedia can be explained with the fol- 
lowing scenarios. 

Distributed processing : The self-describing schemes will provide the ability to inter- 
change descriptions of audiovisual material independendy of any platform, 
any vendor, and any application. The self-describing schemes will enable the 
distributed processing of multimedia content. This standard for interoperable 
content descriptions will mean that data from a variety of sources can be easily 
plugged into a variety of distributed applications such as multimedia proces- 
sors editors, retrieval systems, and filtering agents. 

Content exchange: A second scenario that will greatly benefit from an interoperable 
content description is the exchange of multimedia content among heteroge- 
neous audiovisual databases. The content descriptions will provide the means 
to express, exchange, translate, and reuse existing descriptions of audiovisual 
material. 

Customized views: Finally, multimedia players and viewers compliant with the mul- 
timedia description standard will provide the users with innovative capabilities 
such as multiple views of the data configured by the user. The user could 
change the display's configuration without requiring the data to be down- 
loaded again in a different format from the content broadcaster. 
To ensure maximum interoperability and flexibility, our description schemes use 
the extensible Markup Language (XML), developed by the World Wide Web consortium 
(W3C) [51] Here we briefly discuss the benefits of using XML and its relationship with 
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other languages such as SGML. SGML (Standard Generalized Markup I-*"**^ 
8879^ is a standard language for defining and using document formate. SGML allows 
SSLeU tt self-deLiling; i.e. they describe Uieir own g = a, 
tag set used in the document and the structural relationships that those ^ 
Howler full SGML contains many optional features that « ^ 
tions and has proved to be too complex to current vendors of Web browsers. 

^e W3C has created an SGML Working Group to build a set of specmcaUons to 
™ke k easv and straightforward to use the beneficial features of SGML on .the _ Web [52]. 

srefc'feSL, retains the key SGML advantages in a language that is design^ 

mTOs^nt t image cription^ cheme DTD in a highly modular and extensible way. 
^^^Lcripi scheme consists of several basic components: ^> 

high-tevd semantic relationships (e.g., all faces . .the imaged 
° bjeCt -^ Son scheme also includes one or more object hierarchic £ -g^e the 

such as information in categories of***, wfatf wfctf action f ere >^ n _^. 

Tn add-on to the object hierarchy, an entity-relationship mode is used t .describe 
gpoJf^SS-up. among object elements. Examples ^^S* 
Lr»l mlaiionshiDS and semantic-level relationships (e.g., A is shaking hands witn a). 
P tZSSZ. inches one or more associated features. Each object can accommoda« 
any nuTber ofleatores in a modular and extensible way. the features of an object are 
according - the 

tiple abstraction levels of features can be defined. Each object may oe 
obiects in other modalities through modality transcoding. 

ExWes of visual features include color, texture, shape, location, and motion. 
These t! be extracted or assigned automatically or manually. Semantic feahires 
ESd^StiSs and semantic-level description in different categories (people, loca- 
S££C^*~ etc.). Media feattres describe information such as compression 
format, bit rate, and file location. 
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Each feature of an object has one or more associated descriptors. Each feature can 
accommodate any number of descriptors in a modular and extensible way. External de- 
scriptors may also be included through Unking to an external DTD. For a given descriptor, 
the description scheme also provides a link to external extraction code and similarity 

matching code. ,. . 

The unified description scheme may be applied to image, video, and combinations 
of multimedia streams in a coherent way. In the case of multimedia, a multimedia stream 
is represented as a set of multimedia objects that include objects from the composing 
media streams or other multimedia objects. Multimedia objects are organized m object 
hierarchies. Relationships among two or more multimedia objects that cannot be expressed 
in a tree structure can be described using multimedia entity relation graphs. The tree 
structures can be efficiently indexed and traversed, while the entity relation graphs can 
model more general relationships. 

Details of the description schemes just mentioned can be found m References 53- 

55. 



B. Multimedia Metasearch Engines 

The preceding self-describing schemes are intuitive, flexible, and efficient. We have started 
to develop an MPEG-7 testbed to demonstrate the feasibility of our self-descnbing 
schemes In our testbed, we are using the self-describing schemes for descriptions of im- 
aees and videos that are generated by a wide variety of image-video indexing systems. 
In this section, we will discuss the impact of the MPEG-7 standard on a very interesting 
research topic, image metasearch engines. 

Metasearch engines act as gateways Unking users automatical and transparently 
to multiple search engines. Most of the current metasearch engines work with text. Our 
eariier work on a metasearch engine, MetaSEEk [50], explores the issues involved m 
Querying large, distributed, online visual information systems. MetaSEEk is designed to 
select intelligently and interface with multiple online image search engines by ranking 
their performance for different classes of user queries. The overall architecture of Meta- 
SEEk is shown in Figure. 14. The three main components of the system are standard for 
metasearch engines; they are the query dispatcher, the query translator, and the display 
interface. The procedure for each search is as follows: 

Upon receiving a query, the dispatcher selects the target search engines to be queried 
by consulting the performance database at the MetaSEEk site. This database 
contains performance scores of past query successes and failures for each sup- 
ported search engine. The query dispatcher selects only search engines that 
provide capabilities compatible with the user's query (e.g., visual features and/ 
or keywords). 

The query translators then translate the user query to suitable scnpts conforming to 

the interfaces of the selected search engines. 
Finally, the display component uses the performance scores to merge the results 

from each search engine and displays them to the user. 
MetaSEEk evaluates the quality of the results returned by each search engine based 
on the user's feedback. This information is used to update the performance database. The 
operation of MetaSEEk is very restricted by the interface limitations of current search 
engines For example, most existing systems can support only query by example, query 
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by sketch, and keyword search. Results usually are just a flat list of images (with similarity 
diaseich^ 

accept not only queries by example and by sketch but also queries by MPEG-7 ™ulumed.a 
StionT users will be able to submit desirable multimedia content (specified Jjy 
S 7 descriptions) as the query input to search engines In return search en^wdl 
work on a best effort basis to provide the best search results. Search engines «nfarmhar 
wuh some descriptors in the query multimedia description may just .gnore those descnp- 
torT OtLs may try to translate them to local descriptors. Furthermore, queries will result 
n a HstTf TatcheS multimedia data as well as their MPEG-7 descriptions. Each search 
engine will also make available the description scheme of its content and maybe even 
proprietary code^ ^ ^ .^.^ ^ ^ 

search engine to be a path for MPEG-7 streams, which will enhance the performance of 
metaseaXngines. In particular, the ability of the proposed description schemes to down- 
SXams dynamically for feature extraction and similarity matching by using linking 
oTcod^nZdding will open the door to improved metasearching capabilities. Me^earch 
engTnTsTill use L description schemes of each target search engine to learn about the 
con ten and the capabilities* each search engine. This knowledge also enables meamng- 
M queries to the depository, proper decisions to select optimal search ^"^^ 
ways to merge results from different repositories, and intelligent display of the search 
results from heterogeneous sources. 
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VI. DISCUSSION 

Multimedia search and retrieval involves multiple disciplines, including image processing, 
computer vision, database, information retrieval, and user interfaces. The content types 
contained in multimedia data can be very diverse and dynamic. This chapter focuses on 
the multimedia content structuring and searching at different levels. It also addresses the 
interoperable representation for content description. The impact of the emerging standard, 
MPEG-7, is also discussed from the perspective of developing metasearch systems. 

Many other important research issues are involved in developing a successful multi- 
media search and retrieval system. We briefly discuss several notable ones here. First, 
user preference and relevance feedback are very important and have been used to improve 
the search system performance. Many systems have taken into account user relevance 
feedback to adapt the query features and retrieval models during the iterated search process 
[56-59]. 

Second, content-based visual query poses a challenge because of the nch variety and 
high dimensionality of features used. Most systems use techniques related to prefiltering to 
eliminate unlikely candidates in the initial stage and to compute the distance of sophisti- 
cated features on a reduced set of images [60]. A general discussion of issues related to 
high-dimensional indexing for multimedia content can be found in Reference 61. 

Investigation of search and retrieval for other types of multimedia content has also 
become increasingly active. Emerging search engines include those for music, audio clips, 
synthetic content, and images in special domains (e.g., medical and remote sensing). Wold 
et al. [62] developed a search engine that matches similarities between audio clips based 
on the feature vectors extracted from both the time and spectral domains. Paquet and 
Rioux [63] presented a content-based search engine for 3D VRML data. Image search 
engines specialized for remote sensing applications have been developed [33,34] with 
focus on texture-based search tools. 

Finally, a very challenging task in multimedia search and retrieval is performance 
evaluation. The uncertainty of user need, the difficulty in obtaining the ground truth, and 
the lack of a standard benchmark content set have been the main barriers to developing 
effective mechanisms for performance evaluation. To address this problem, MPEG-7 has 
incorporated the evaluation process as a mandatory part of the standard development pro- 
cess [64]. 
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